SCADIE: simultaneous estimation of cell type proportions and cell type-specific gene expressions using SCAD-based iterative estimating procedure

A challenge in bulk gene differential expression analysis is to differentiate changes due to cell type-specific gene expression and cell type proportions. SCADIE is an iterative algorithm that simultaneously estimates cell type-specific gene expression profiles and cell type proportions, and performs cell type-specific differential expression analysis at the group level. Through its unique penalty and objective function, SCADIE more accurately identifies cell type-specific differentially expressed genes than existing methods, including those that may be missed from single cell RNA-Seq data. SCADIE has robust performance with respect to the choice of deconvolution methods and the sources and quality of input data. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-022-02688-w).


S2 Supplementary Figures and Tables
: W and H accuracy over iteration for three W -update methods: NNLS, ridge regression, and SCAD-penalized regression. a. W 1 accuracy over iteration; b. W 2 accuracy over iteration; c. H 1 accuracy over iteration; d. H 2 accuracy over iteration.  Figure S2: Histograms showing the estimated W 2 {W 1 for known up-regulated and down-regulated DEG entries from three W -update methods: NNLS, ridge regression and SCAD-penalized regression. Sensitivity, specificity and positive predictive rate (PPV) are shown under each figure's title. Red vertical lines indicate the true fold change level. Due to the matrix-wise dissimilarity penalty in ridge regression, sensitivity from ridge regression is extremely low. Both independent W -update through NNLS and SCAD-penalty achieve high sensitivity, but SCAD has better false positive control (shown by higher PPV) due to its entry-specific precise penalty, this is also lined up with results from Figure 2. As the vast majority of entries are non-DEGs, specificity in this case is less meaningful than PPV. Iteration log_KL_Divergence e a b Figure S3: Benchmark cell type proportion estimations from SCADIE against DWLS, CIBERSORTx, MuSiC, and the naive iterative procedure with NNLS W-update: a. K-L Divergence between H and the ground truth proportions across three data sets, SCADIE and NNLS iteration's Hs were from the final iteration output, and Hs of DWLS and CIBERSORTx were directly from deconvolution; b. Same results as a) but measured by root-mean-square error (RMSE), the result patterns are consistent with those in K-L Divergence; ce: H accuracy over iterations, where figure b is simulation dataset, figure c is pseudo-bulk dataset, figure d is bulk microarray dataset.  Figure S4: A summary table comparing single cell-derived DEGs and SCADIE-inferred DEGs between COPD and control samples, the column " %DEGs Correct Direction from SCADIE " represents the percentages of single cell DEGs that were of concordant directional changes from bulk data inferred by SCADIE, the column "%SC Correctly Significant DEGs identified from SCADIE" represents the percentages of single cell DEGs that were also identified as significant DEGs from bulk data by SCADIE, the column " #Additional DEGs from SCADIE" represents the DEGs SCADIE identified from bulk data that were not identified by single cell DEG.  Figure S6: Sensitivities and false positive rates for up-regulated and downregulated DEGs over a range p values under initialization: "0" is using the ground truth H for initialization, "Ini" is the same SCADIE initialization used in Section "SCADIE can better identify DEGs", "Ini + 0.01" refers to additional sd=0.01 Gaussian noise was added to Ini's H and each column of H was re-balanced to 1, similar for "Ini + 0.05" where the sd = 0.05.

S3.1 Theoretical properties
Because NNLS is used when updating W and H, in this subsection, we develop a general theory for the NNLS estimate under some regularity conditions. Letŵ andw be the NNLS and Ordinary Least Squares (OLS) estimators, respectively. Even in the non-negative regression coefficient setting, we can empirically check that there is no guarantee that the error }ŵ´w o } is less than }w´w o }, where w o is the underlying non-negative coefficient vector. For example, we considered a linear regression model y " Xw o` , where m " 300, k " 5, entries in X's are i.i.d. N(0,1), and rw o s j 's are i.i.d. Uniform(0,5). We observed that }ŵ´w o } is greater than }w´w o } for about 30% of the cases. With this regard, the bound in Theorem 1 is not trivial.
Theorem 1. [General estimation error bound] Suppose that y " Xw o` , where y P R m is a response vector, X P R mˆk is observable and its columns have full-rank, and w o P R k , i.e., its components are non-negative. Suppose that κpX T Xq :" λ max pX T Xq{λ min pX T Xq ď M for some constant M ą 0. Let w be the non-negative least squares estimate, i.e., w " argmin wPR k }y´Xw} 2 .
Note that K " tXw : w P R k u is a closed convex cone. Then, for a projection mapping pp¨q onto K, we can write Since pp¨q is non-expansive as in Lemma 3 of [11], we have Hence, where the last inequality follows from the fact that XpX T Xq´1X T is a projection matrix. This completes the proof.
The following Corollary 1 shows theoretical properties of the proposed method.
M }r s j }, where r s j and w o j represent the jth columns of r 1 , 2 , 0s T and rW o We complete the proof of the corollary.
BecauseX pjq defined in the proof of Corollary 1 has a uniformly bounded condition number κ over j due to the fact that E j 's are bounded, Theorem 1 implies thatŴ 1 ,Ŵ 2 are as closed as the norm of the error to the underlying Corollary 1 also shows that the proposed estimator with an appropriately chosen ζ n has the bounded estimation error. The condition imposed on ζ n implies that the weight E jk is non-zero only when the underlying parameter satisfies rW Although the condition imposed on ζ n can not be verified in real data application, in theory, this condition always holds for ζ n in an appropriate range.

S3.2 Sensitivity analysis regarding to ζ n
We performed simulations to study the sensitivity of the results with respect to ζ n . We used the following model: In this model, data were simulated under n " 200, m " 50, and k " 5. All columns in H 1 were generated from the same Dirichlet distribution, while all columns in H 2 were from another Dirichlet distribution, i.e., h 1 " Dirpπ 2 q. For the concentration parameters π k , we used π 1 " π 2 " r1, 2, 3, 4, 5s.For W 1 and W 2 , we constructed a matrix W with i.i.d. N(0,2) entries, then randomly chose 5% entries, called Ω, such that the selected genes were differentially expressed for the two groups. Then, we set rW 1 s ij " rW 2 s ij " W ij for pi, jq R Ω, and rW 1 s ij " 0.5W ij and rW 1 s ij " 2W ij for pi, jq P Ω.For the ζ n value, we considered ζ n P t1, 2, 4, 8, 16, 32, 64, 128, 256, 512u. Tables S1 and S2 summarize the mean correlations of the estimated H k from different ζ n for k " 1, 2, respectively. We can see that the obtained H k s were quite robust with respect to ζ n values, demonstrating the robustness of the result when ζ n is in an appropriate range. Tables S3 and S4 show the relative difference norm of the estimate W k obtained from the different ζ n for k " 1, 2, respectively. For example, for the estimates obtained from ζ 1 and ζ 2 , we record }W k pζ 1 q´W k pζ 2 q}{p}W k pζ 1 q}`}W k pζ 2 q}q, where W k pζq is the output using the ζ value. We can observe that the relative errors are quite small, which implies that the results of W k s are robust to the choice of ζ n if it is chosen from an appropriate range.     Table S4: Relatively difference between W 2 s obtained from different ζ n values.
Next, we examine how ζ n might affect real data DEG outcomes. We ran SCADIE with ζ n " 2 and ζ n " 0.1, and plot volcano plots similar to Figure 5. As can be seen from Figure S7, the DEGs log2 fold-change and p-values are largely similar, which indicates our DEG results are robust if the parameter ζ n is in the appropriate range.

S3.3 Computational efficiency
We conducted simulations to investigate the computational cost of the proposed algorithm. We used the following model: To assess whether the proposed algorithm is even feasible to be run on the whole transcriptome, we set m P t500, 1000, 2000, 5000, 10000u, n " 100, and k " 5 to simulate data. The other model parameters are set as in Subsection S3.2. We conduct simulation 50 times and record its average minutes and the one standard deviations. We compare the proposed SCADIE with the conventional NMF method, which updates W 1 and W 2 separately via NNLS. As can be seen in Table S5, compared to the NNLS, SCADIE takes more time. But even for large m " 10000 case, the proposed algorithm takes less than 8 minutes on average, which implies that it may be feasible in such large data cases.

S3.4 Performances of SCADIE under different cell type-sample size ratios
As discussed in Section "DEG identification under poor initialization and limited sample size", SCADIE's performance might be limited when the sample size only marginally larger than the number of cell types. In this section we perform a comprehensive examination of sample size's effect on outcome quality through simulations.
We first conducted simulations to investigate how estimation performance of SCADIE depends on the underlying number of cell-types compared to the sample sizes. Given the number of samples n and the number of genes m, we assume that the underlying number of cell types k is less than minpn, mq. This is because if k ě minpn, mq, the columns of the factor matrix W or F is linearly dependent or full rank, which is undesired case in the factorization methods such as NMF. Note that one of the main goals of NMF is to reduce the number of parameters in the estimation procedure using kpn`mq parameters. If k ě minpn, mq, the number of estimated parameters is kpn`mq Á nm, which is the order of the number of components in the data Y .
Thus, in this simulation, for a given underlying number of cell types k " 5 and the number of genes 200, we consider n P t6, 10, 20, 50, 100, 200, 500, 1000u number of samples in SCADIE. The other model parameters are set as in Subsection S3.2. For the estimatesŴ k andĤ k , we record the relative error (RE) of theŴ k with respect to the underlying W k , i.e., }Ŵ k´Wk } F {}W k } F , and the average correlation of columns ofĤ k and H k . Table S6 reports the performances of SCADIE when the sample size n varies. It can be seen that SCADIE's output W s and Hs show robust high similarities with groundtruth when sample size equal or greater than 2x number of cell types.  Table S6: Performances of the estimates with respect to sample sizes.
To further investigate how these similarities reflect on DEG outcomes, we used the previous mouse ISC pseudo bulk data set to benchmark DEG identifications. There are four cell types in the dataset and we used sample sizes equal 6, 8, 12, 16, 20 in our simulations. As can be seen from Figure S8, both sensitivity and specificity increase with sample size, and when sample size equals 6, SCADIE's performance in up-regulated DEGs was not significantly better than random. Taken together, we recommend using SCADIE for DEG identification when sample size is at least 1.5x of number of cell types.

S3
.5 When the sample sizes n 1 and n 2 are different SCADIE gives the same weight to the first two terms in the objective function, i.e., w 1 " w 2 in w 1 }Y 1´W1 H 1 } 2 F`w 2 }Y 1´W1 H 1 } 2 F . One may take different values to w 1 and w 2 , e.g., depending on the sample sizes, one may set w 1 {w 2 " n 2 {n 1 . Specifically, we propose to use w 1 " n2{n1, w 2 " 1 for the first two terms if the ratio of n 1 and n2 is very different from 1. This is motivated by the fact that the first two terms can be understood as empirical risks, by averaging the loss function on the samples.
In this subsection, we conduct simulation analysis to investigate the performance of SCADIE and the weighted SCADIE by assigning different weights on the first two terms such that w 1 " n2{n1, w 2 " 1, when the ratio of the sample sizes of two groups vary. We set the number of samples n 1 " 100 and n 2 " rαn 1 s, where α P t 1 50 , 1 20 , 1 10 , 1 4 , 1 2 , 1, 2, 4, 10, 20, 50u.
The other model parameters are set as in Subsection S3.2. We conduct simulation 50 times and record average performances of SCADIE and the weighted SCADIE using the measures defined in Subsection S3.4. As can be seen in Table S7, for all the ratio values α, SCADIE well estimates H i and W i . Table S8 records the performance measures of the weighted SCADIE, which shows that weighted SCAIDE performs better than SCADIE when the ratio of two dimensions are very small or large, e.g., α P t 1 50 , 50u. However, when the ratio of two dimensions are moderate, the weighted SCADIE does not outperform SCADIE. Given that ratios of sample sizes in our real data analyses always between 1/10 and 10, assigning equal weights to the first two terms in the objective function in SCADIE seems to be reasonable.
It is worth noting that imposing another tuning parameter to the first or second term can improve the performances of SCADIE, but it will take additional time to tune the additional tuning parameter. We leave this issue for future research.

S3.6 Comparison of Jackknife and Bootstrap
We evaluated the jackknife standard error estimates by comparing them with bootstrap estimates on the simulated data described in Section "Methods -Simulation models and benchmarking" in the main text.
For bootstrapping, we chose a wide range for the number of resamplings: from half of sample size 10, to 20x sample size 400. And each random sample was generated by sampling 20 columns (which equals the observed sample size) from Y 1 , Y 2 and H 1 , H 2 , respectively. And the standard error Σ bs W1´W2 was obtained by empirical standard error across all samples.
The column-column correlations between jackknife Σ W1´W2 and bootstrap Σ bs W1´W2 s show extremely high concordances ( Figure S9). Next, we investigate the run time of jackknife and bootstrap. As can be seen from Figure S10, the run time is proportional to the number of SCADIE runs. Although jackknife and bootstrap show highly concordant empirical results, due to the sampling with replacement nature of bootstrap, there exists risk of higher noise due to singularity in Y s and Hs, i.e., the duplicated columns in sampled Y sample and H sample increases uncertainty in regression. In this regard, we recommend using jackknife for general purposes. In general, the bootstrap is more computationally intensive than Jackknife [35]. The Jackknife tends to perform better for confidence interval estimation for pairwise agreement measures and the Jackknife is more suitable for small original data samples [6].